Netflix Culture

Introduction

Netflix is one of the most popular streaming services in the world. It offers a wide range of movies and Tv shows to its subscribers. Netflix is popular with its foreign-langauge, genre specific and binge-worthy content. It provides the audience with quality and original products which is why it records a huge success in the market. As each company Netflix, also relies on a huge amount of data (Big Data). As a big streaming company it collects data from its subscribers about their actions like what they watch the most, when they watch and how long they watch. This data is working also for their recommendation system. By analyzing subscribers viewing history and behavior Netflix offers content that the subscriber is most likely to be interested in.Hence, the audience stays engaged with the platform and benefits the company itself.

The aim of our project

As having subscribers of Netflix in our group members we got interested in analyzing some patterns that are used in its Tv shows and movies. We decided to choose two datasets containing different types of information about Netflix like the names of the movies, the names of the TV shows, the producing year, the producers, etc. Having this amount of data gave us the opportunity to analyze some patterns in the content and provide some visualizations demonstrating them in a more clear way. The purpose of this paper is to clean, analyze and visualize the data we have explaining our steps in detail. The language we used for all the processes is Python.

Datasets As mentioned above we have two datasets netflix_titles .csv and imdb_top_1000 .csv. Both datasets are obtained from Kaggle.com which is one of the largest data science communities providing reliable and useful resources. netflix_titles.csv contains unlabelled text data of around 9000 Netflix Shows and Movies along with full details like Cast, Release Year, Rating, Description, etc. imdb_top_1000 .csv is an IMDB Dataset of top 1000 movies and tv shows. In addition to the datasets we have used a json file in our project called countries.geojson

Data Cleaning

Below are all the libraries we inserted. 1)Numpy is a library for numerical computing in Python. It provides tools for working with arrays and matrices. 2)Pandas is a library for data manipulation and analysis. It provides tools for data analysis. 3)Seaborn is a library for statistical data visualization. It provides a high-level interface for creating statistical graphics. 4)Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides various tools for creating various types of plots. 5)Wordcloud is a library for creating word clouds in Python. A word cloud is a visual representation of text data, where the size of each word is proportional to its frequency in the text. 6)TextBlob is a Python library used for processing textual data. It is built on top of the Natural Language Toolkit (NLTK) library and provides a simple API for common natural language processing (NLP) tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction, and more. 7)And the last, itertools is a library for working with iterators, which are objects that can be looped over. It provides tools for creating, combining, and manipulating iterators.

This is the first cell of our python code. The first two rows of the cell are creating two pandas dataframes one df and the second df2. Both lines are reading the csv files and allow us to work with the data inside of the datasets. The third line sets the requirement for pandas to display all the columns of the data frames because without setting the option to “None” the Pandas would limit the number of columns when displaying by default. Obviously, we needed to insert some libraries to make analysis and data visualization. Below are all the libraries we inserted.

Getting basic information about dataset

isnull().sum()is a pandas method chain used to find the number of missing values in each column of a DataFrame df. Missing values can be zeros, NaNs, etc.

df["country"].fillna("MISSING", inplace=True) line fills the missing values in the "country" column of the dataset with the "MISSING", which is a string.

df["duration"].fillna("0 min", inplace=True) line fills the missing values in the "duration" column with "0 min".

df["director"].fillna("Unknown", inplace=True) line fills the missing values in the "director" column with "Unknown".

df["cast"].fillna("Unknown", inplace=True) line fills the missing values in the "cast" column with "Unknown".

df["date_added"].fillna("Unknown", inplace=True) line fills the missing values in the "date_added" column with "Unknown".

df["rating"].fillna("Unknown", inplace=True) line fills the missing values in the "rating" column with "Unknown".

df["duration"].fillna("Unknown", inplace=True) line fills the missing values in the "duration" column with "Unknown".

In all the above cases the last words are strings. In all the cases we write inplace= True to indicate that we want to amke changes on the original dataset instead of creating a copy. At the end we get a dataframe with no missing and NaN values.

describe() is a method which returns a summary of the central tendency, dispersion, and shape of the distribution of the columns of our DataFrame. In the output we can see the words "count", "mean","std","min","25%","50%","75%","max". These are the information about the columns of our dataframe already cleaned from missing vales. Here count: The number of non-null values in each column of the DataFrame mean is the average of each column,std is the standars deviation, min is the minimum value, 25% is the 25% of each column, similarly are 50% and 75% and the max is the maximum value of the coumns.

df.shape returns a tuple with the number of rows and the number of columns in the dataframe. In our case the number of the rows is 8807 and the number of columns is 12.

In pandas the method df.columns returns the names of the columns. We can see the names of the columns of our dataset above and we can also see the type of it which is "object".

count() method returns the number of non-null values in each column of a DataFrame. This method can be used to quickly identify missing values in a Dataframe.

nunique() method returns the number of unique values in each column of a Dataframe.

Instead of writing for each row of the code the first row will be explained as the others are the same just for different columns.

print(f" dtype - show_id: {df.show_id.dtype}") We are printing the data type of the column. (f" dtype - show_id) this is for having a string with the name of the column and the meanoning of our code in the output.

From the output we can see that we have 11 columns with type object and 1 column with type int64

dropna() method is used to remove missing values from a DataFrame. The axis parameter specifies whether to remove rows or columns that contain missing values.how parameter is the condition for removing a row or column.

Working with duration column and splitting it into 2 columns

unique() method is used to get an array of unique values in a Dataframe column.In our case unique() will return an array of unique values in the "duration" column.

The above code is to split the duration column into two parts integers and Seasons or min. Each line is explained in the code cell.

In this case unique() will return an array of unique values in the "rating" column.

Here, as we can see, we have a problem, since three values of from "duration" column have been placed in the "rating" column. What we will do, is simply getting their indices and then we will assign them their true vallues. Those values in "rating" column will be marked as "unknown", while in the duration column, they will get their true values.

Number of TV Shows and Movies

Two new DataFrames, one for movies and one for TV series created from original df enable us to compare the numerical variety of TV series and movies independently.

Calculating the number of movies and TV shows in the dataset, some basic analysis is performed above.

By dividing the number of movies by the entire number of movies and TV programs, and the number of TV shows by the total number of movies and TV shows, the percentage of movies and TV shows is calculated. The pie chart is then shown with the labels "Movies" and "TV Shows" and matching colors using the matplotlib tool. The pie chart makes it easier to see the distribution of movies and TV series included in the dataset.

Years when the highest/lowest number of movies/TV shows available on Netflix were produced.

This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.

This algorithm examines the dataset to identify the years where the streaming service produced the most and the least TV shows and movies.

Rating System

The graph makes it easier to see how Netflix movies and TV series are rated. It demonstrates that most of the Netflix selection of movies and TV series are rated for mature audiences, with TV-MA and TV-14 following closely behind. The graph also includes a limited number of G-, PG-, and TV-G-rated films and television programs that are appropriate for younger viewers. Overall, this code offers useful information about the kinds of Netflix material that are offered and the intended viewers for each rating group.

Wordcloud of movie titles

Let's create something interesting. What about a wordcloud with the words in it, that were used to create the title for each movie. What will this give us? We will find out which words were used the most for more than 8000 movies when creating their titles.

Based on the titles of the films and TV series in the dataset, this code creates a word cloud. The WordCloud class from the wordcloud library is used for this purpose. Those words that appear most frequently in the names of the films and TV series in the dataset are represented visually in the resulting word cloud.

Getting Top directors (w/ >= 10 movies)

By only taking into account filmmakers who have directed more than 10 films, this code creates a bar plot showing the number of films each director (or the group of directors (if they worked on the same project together)) in the dataset has directed. The popular_directors variable is used to filter out just the filmmakers who have directed more than 10 films, and the value_counts() function is used to count the number of films each director has directed. Using a colormap, the bars' colors are determined by the amount of films that each filmmaker has directed. The plot is then constructed using the popular_directors.plot() method with appropriate labels, a clear title, and rotation. The plot makes it easier to see which directors have produced the most Netflix films.

Duration distribution

We display the dataset's distribution of movie runtimes here. To extract the duration values, we first filter the dataframe to only include the rows representing movies (not TV programmes). Then, using the plt.hist() function and 25 bins, we plot a histogram of the durations. The figure demonstrates that the majority of the dataset's films have running times between 70 and 120 minutes, peaking around 90 minutes. There are a very small number of movies that are longer than 200 minutes, but they do exist. This story can offer us a general notion of how long Netflix movies tend to be, which can be helpful for content producers who wish to make movies that appeal to the platform's customers.

Just like the previous one, we display the durations of TV shows. This histogram indicates that most Netflix TV programs have between 1-4 seasons, with 1-2 seasons having the highest frequency. A significant percentage of TV series have run times of 5–10 seasons, while a much smaller proportion have run times of more than 10.

Average movie duration over time

According to the plot, the average movie runtime has varied throughout the years, generally growing since the early 2000s. It should be emphasized that this research solely takes into account films and ignores TV shows. As a result, the story only offers a fragmented view of the overall rise in Netflix content duration.

Sentiment Analysis

Sentiment analysis is used in order to understand the emotion or the attitude of something. In our case, by applying this analysis for the "Description" column, we can determine whether the content on Netflix mainly is positive or not.

For this, we import Textblob library. Next, we use the sentiment function, whicih has two properties polarity and subjectivity. We will focus on the polarity part. It returns number from -1 to 1, if it is >0 than the content is positive, if <0, then negative, if it is equal to zero, then the content is neutral.

As we can see, over years, the movies on netflix have become more positive, since the number of movies with positive content has increased signinficantly.

Movies distribution among countries

The Counter object counts the number of occurrences of each country in the list, and returns a dictionary-like object where the keys are the countries and the values are the counts.

The keys() method is used to extract the keys (i.e., the country names) from the dictionary country_counts_dict.

The values() method is used to extract the values (i.e., the counts) from the dictionary country_counts_dict.

The code above creates a dictionary, where the keys represent the countries and the values show how many movies were produced in each country

New Dataset - IMDB TOP 1000

Our second dataset includes top 1000 IMDB movies. We do not use it seperately, instead we are going to use it for our origingal "netflix movies" dataset. We will first merge these two datasets, to find out the list of those Netflix movies whicih were classified as top movies in IMDB and then, we will work with this new dataset. But first, we will get basic information about the second - "IMDB toP 1000" (df2) dataset, by using the same functions that were used for the first dataset.

Basic information about the Dataset

In order to be able to merge this datasets by the "title" of the movie, we modified the column name for df2, to match it with the column name of Titles of df1.

For now, we will change the dtype of the column "Released_year" of df2 since it is of dtype("O"). We will convert it to Int64. A little bit later, we will see why we did this.

We merged the datasets, and now it seems that everything should be fine. However, there were cases, when two DIFFERENT movies produced by different directors and in different years, had the SAME name. Therefore, we should try to identify these "wrong" movies. We will do this by comparing the columns "Released_Year" of df2 and "release_year" of df. Since the dtype of these two columns were different, there was a possibility, that when comparing these two, there would be mistakes. That's why we modified the type of the column "Released_Year" a few lines above.

After doing a couple of easy steps, now we have the perfectly right dataset. The dataset of Netflix movies which were included in IMDB Top 1000 movies.

Finding correlation between MetaScore and IMDB Rating

HMMM, both Metascore and IMDB rating are used to indicate the quality of the movie. Then what is the difference between them? And why should we find a link between them? Firstly, while both are used to know the rating of the movie, Metascore is a weighted average of critic reviews from a variety of publications, including newspapers, magazines, and online review sites. The scores are assigned on a scale of 0-100, and the weighted average is used to calculate the Metascore. On the other hand, IMDb (Internet Movie Database) rating is a rating given by registered users of the IMDb website. Users can rate movies and TV shows on a scale of 1 to 10, and the IMDb rating is calculated based on the average of all user ratings. This means, that Metascore is a more objective estimation of a movie, then IMDB Rating.

Now, let's understand, if the ratings of critics and users are similar or not? For this, we will plot a graph which shows the relation between them.

For the highest-rated Netflix movies, a correlation between user and critic reviews is calculated and shown. The correlation between the Meta_score and IMDB_Rating columns is first determined using Pandas' corr() function. The coefficient and a description of the association are then printed, with the coefficient rounded to two decimal places. The plot shows that there is a weak positive correlation between the Meta Score and the IMDB Rating. Thus, indicating that we cannot do certain conclusions about the other one, when considering one of them.

Correlation between Number of Votes and IMDB Rating

The relationship between the number of votes and the IMDB rating of the highest rated Netflix movies and TV series is examined here. There is a slight positive link between user ratings and the total number of votes, according to our calculation of the correlation coefficient between the two variables, which comes out to 0.59. The link between the two variables is then depicted using a scatter plot of the two data points and a regression line. We can detect an upward trend, which supports our theory that the IMDB rating rises as the number of votes does.

Genre distribution (in netflix movies that are in IMDB top 1000 movies list)

We use a pandas series to count the instances of each genre in the list, and a horizontal bar chart is used to visualize the results. The resulting graph illustrates the most common genres in the highest-rated films along with the quantity of each genre's films. It offers information about the tastes of the viewers and raters of these films. In this instance, we can observe that Drama, Action, Comedy, Adventure, and Crime are the most popular genres.

Top 10 Actors (in netflix movies that are in IMDB top 1000 movies list)

We pick the top 10 actors from the 'actors_count' Series who have appeared in the most top 1000 IMDb-rated movies, and we plot the horizontal bar graph with their counts. The plot, which was made using the matplotlib tool, shows the top 10 actors on the y-axis and the number of films they have appeared in on the x-axis. The horizontal bar graph indicates that Amir Khan, RObert De Niro and Mark Ruffalo are the top 3 actors whose films are most commonly ranked in the top 1000 IMDb ratings.

WordCloud of Actors' names

AGAIN WORDCLOUDS! They are beautiful, aren't they? Now let's creat a wordcloud with the actors name in it. The actors whose films are in the top 1000 imdb ratings are grouped into a word cloud here. The actors' names that appear most frequently in the dataset are displayed in the word cloud that results, with bigger font denoting more occurrences in the list. It provides a visual representation of the most well-known actors in the top 1000 imdb-rated Netflix films.

Collaborations between Directors and actors.

What if we want to know which famous directors work with which actors? What if we want to see get som insighs about their colloborations? After all, the success of project comes from successful collaboration between the actor and the director. Then what we definitely need to do is to create a colloboration graph for Directors and actors. However, since our data is huge, and there are a lot of directors and actors, the picture will not be clear if we do it for the whole dataset. THus, we will create the graph for the collaboration between Top 10 directors and the actors they worked with.

The collaboration between the top 10 directors and the actors they have worked with in films with the top 1000 IMDB ratings is displayed using a network visualization created using the NetworkX library. Then, using G = nx, a fresh NetworkX graph is produced.Graph(). Each director and actor is added as a node with a "director" or "actor" property, and nodes are added to the graph using G.add_nodes_from(). Using G.add_edge(), edges are added to the graph, each of which represents a cooperation between a director and an actor. Director nodes are colored differently based on their attribute (directors are red, actors are pink), and lines are created between the nodes to signify cooperation. The network is then displayed using nx.draw_networkx(). A network of partnerships between the top 10 directors and actors in films with the top 1000 IMDB ratings is the graphic that is the outcome.

As we can see from the graph, the red dots represent the 10 directors, while the pink smaller dots are the actors they collaborated with. The edge show that those particular director-actor pair worked together.

Grouping the movies based on their IMDB Rating

We arrange the films in netflix_top_imdb according to their IMDB rating and print the titles of the films in each group. According to their IMDB ratings, the movies in each group are arranged in descending order. This offers a quick method to view which movies fall into each rating category and how their IMDB ratings stack up against one another.

Movie Search System

The user is asked to input the answers to some questions for a more effective search. If they dont have specific preferences they could type "skip"

This code is part of a movie search system that matches user preferences with movies in a dataset to generate a list of recommended films. The process takes user preferences for the type of movie, director, country, release year, and duration as input and creates a list of all the user preferences. It then filters the list so that no "skip" value remains, calculates how many variables are stored in the list, and calculates how much is the contribution of each inputted preference. The system then creates a new column in a data frame, indicating the exact movie's similarity percentage to the user's search. It sets a condition for each user preference and increments the percentage column for each matched movie in the data frame. Finally, the system creates a new data frame, sorted with respect to the percentage column containing the recommended movies.

Conclusion

  1. The Netflix dataset consisted of 8807 rows and 12 columns..
  2. The Netflix has 6131 movies and 2676 TV shows.
  3. The Movies make up 69.6% of the whole and TV Shows 30.4%
  4. In 2018 Netflix produced its maximum number of movies
  5. Minimum number of movies available on Netflix are from 1961, 1959, 1925, 1966, 1947
  6. Starting from 2011, The number of movies published yearly drastically increased. However it started to decline after 2018 until the lates available year of this dataset: 2021
  7. Most of the Netflix selection of movies and TV series are rated for mature audiences, with TV-MA and TV-14 following closely behind. Very small number of Movies/TV Shows have ratings NC-17 , UR, TV-Y7-FV.
  8. The word most frequently used in the titles of Netflix movies/TV shows are Love, Life, Girl, Christmas, World
  9. The director with the greatest number of movies is Rajiv Chikala
  10. The majority of the dataset's films have running times between 70 and 120 minutes, peaking around 90 minutes. There are a very small number of movies that are longer than 200 minutes, but they do exist.
  11. Most Netflix TV programs have between 1-4 seasons, with 1-2 seasons having the highest frequency. A significant percentage of TV series have run times of 5–10 seasons
  12. The duration of the movies available on Netflix are the greatest during the years 1960-1967. Even though Netflix did not exist at that time, its platform contains movies and shows from that time.
  13. During the years, the number of movies with positive content increased in comparison with the movies with neutral and negative content.
  14. The countries where most netflix movies/TV shows were produced are United States (3690), India(1046), United Kindgom (806)
  15. There are 144 Netflix movies that are included in Top 1000 IMDB Rating Movies.
  16. There is a weak positive correlation between IMDB Ratings and MetaScores. That means that with given the MetaScore we cannot guess IMDB Rating.
  17. The actors with the greatest number of movies are Aamir Khan, Leonardo DiCaprio, Mark Ruffalo. (This information is demonstrated using both the word cloud and barplot).
  18. Collaboration between actors and top directors.
  19. We created a mechanism of filter movies based on users' preferances (based on the type, duration, director, country, ect)